whittle index
- Oceania > New Zealand (0.04)
- North America > United States > New York > Suffolk County > Stony Brook (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- Africa > Kenya (0.04)
- South America > Peru > Lima Department > Lima Province > Lima (0.04)
- (5 more...)
Finite-Time Analysis of Whittle Index based Q-Learning for Restless Multi-Armed Bandits with Neural Network Function Approximation
Whittle index policy is a heuristic to the intractable restless multi-armed bandits (RMAB) problem. Although it is provably asymptotically optimal, finding Whittle indices remains difficult. In this paper, we present Neural-Q-Whittle, a Whittle index based Q-learning algorithm for RMAB with neural network function approximation, which is an example of nonlinear two-timescale stochastic approximation with Q-function values updated on a faster timescale and Whittle indices on a slower timescale. Despite the empirical success of deep Q-learning, the non-asymptotic convergence rate of Neural-Q-Whittle, which couples neural networks with two-timescale Q-learning largely remains unclear. This paper provides a finite-time analysis of Neural-Q-Whittle, where data are generated from a Markov chain, and Q-function is approximated by a ReLU neural network. Our analysis leverages a Lyapunov drift approach to capture the evolution of two coupled parameters, and the nonlinearity in value function approximation further requires us to characterize the approximation error. Combing these provide Neural-Q-Whittle with $\mathcal{O}(1/k^{2/3})$ convergence rate, where $k$ is the number of iterations.
DeepTOP: Deep Threshold-Optimal Policy for MDPs and RMABs
We consider the problem of learning the optimal threshold policy for control problems. Threshold policies make control decisions by evaluating whether an element of the system state exceeds a certain threshold, whose value is determined by other elements of the system state. By leveraging the monotone property of threshold policies, we prove that their policy gradients have a surprisingly simple expression. We use this simple expression to build an off-policy actor-critic algorithm for learning the optimal threshold policy. Simulation results show that our policy significantly outperforms other reinforcement learning algorithms due to its ability to exploit the monotone property.In addition, we show that the Whittle index, a powerful tool for restless multi-armed bandit problems, is equivalent to the optimal threshold policy for an alternative problem. This observation leads to a simple algorithm that finds the Whittle index by learning the optimal threshold policy in the alternative problem. Simulation results show that our algorithm learns the Whittle index much faster than several recent studies that learn the Whittle index through indirect means.
Model-Based Learning of Whittle indices
Charles-Rebuffé, Joël, Gast, Nicolas, Gaujal, Bruno
We present BLINQ, a new model-based algorithm that learns the Whittle indices of an indexable, communicating and unichain Markov Decision Process (MDP). Our approach relies on building an empirical estimate of the MDP and then computing its Whittle indices using an extended version of a state-of-the-art existing algorithm. We provide a proof of convergence to the Whittle indices we want to learn as well as a bound on the time needed to learn them with arbitrary precision. Moreover, we investigate its computational complexity. Our numerical experiments suggest that BLINQ significantly outperforms existing Q-learning approaches in terms of the number of samples needed to get an accurate approximation. In addition, it has a total computational cost even lower than Q-learning for any reasonably high number of samples. These observations persist even when the Q-learning algorithms are speeded up using pre-trained neural networks to predict Q-values.
- North America > United States > Texas > Brazos County > College Station (0.14)
- Oceania > New Zealand (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Energy (1.00)
- Banking & Finance > Economy (0.93)
- Transportation > Ground > Road (0.69)
- (3 more...)
- Information Technology > Data Science > Data Mining (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Africa > Kenya (0.04)
- South America > Peru > Lima Department > Lima Province > Lima (0.04)
- (5 more...)
- North America > United States > Texas > Brazos County > College Station (0.14)
- Asia > Taiwan (0.04)
- Oceania > New Zealand (0.04)
- (3 more...)
- Transportation (0.48)
- Government (0.46)
- Health & Medicine (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.92)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment
Amiri, Mohsen, Avrachenkov, Konstantin, Mimouni, Ibtihal El, Magnússon, Sindri
Restless Multi-Armed Bandits (RMABs) are powerful models for decision-making under uncertainty, yet classical formulations typically assume fixed dynamics, an assumption often violated in nonstationary environments. We introduce MARBLE (Multi-Armed Restless Bandits in a Latent Markovian Environment), which augments RMABs with a latent Markov state that induces nonstationary behavior. In MARBLE, each arm evolves according to a latent environment state that switches over time, making policy learning substantially more challenging. We further introduce the Markov-Averaged Indexability (MAI) criterion as a relaxed indexability assumption and prove that, despite unobserved regime switches, under the MAI criterion, synchronous Q-learning with Whittle Indices (QWI) converges almost surely to the optimal Q-function and the corresponding Whittle indices. We validate MARBLE on a calibrated simulator-embedded (digital twin) recommender system, where QWI consistently adapts to a shifting latent state and converges to an optimal policy, empirically corroborating our theoretical findings.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Oceania > New Zealand (0.04)
- Oceania > New Zealand (0.04)
- North America > United States > New York > Suffolk County > Stony Brook (0.04)